In the world of finance, changing databases is usually pretty rare. When you’re in charge of other people’s money — several trillion dollars of it, in the case of one of the banks discussed in this article — even small changes could represent major risks.
That’s why even today, many banks still run systems based on legacy relational databases such as Oracle, IBM, and PostgreSQL — often the same databases they were using 10 or even 20 years ago.
But the tide is turning. Fortune 500 banks are increasingly finding that the demands of their modern, global userbase and a shifting regulatory landscape make sticking with legacy systems more risky than modernizing their stack. Many of these banks are turning to CockroachDB, a distributed SQL database that offers many of the same advantages as their legacy RDBMS, without the risks and downsides.
In this article, we’ll look at the reasons two major financial institutions have opted to move from legacy systems to CockroachDB for their mission-critical workloads. These banks are both in the Fortune 50 and collectively they manage roughly $6 trillion in assets and process more than $10 trillion in payments every single day.
One of the major reasons that banks have started to move away from legacy systems is to mitigate the business risks posed by black swan events and other unplanned downtime. Traditional RDBMS like Oracle and PostgreSQL were not built with resilience in mind, and while the banks built systems to increase their survivability, unplanned downtime still happened — and it was costly.
For example, one Fortune 50 bank began moving workloads to CockroachDB after a storm knocked critical systems offline. At the time, the bank was running its applications from a mainframe that ran on grid power, with diesel generators as a backup power source. The storm knocked out power to the grid and snowed out the roads, preventing diesel delivery trucks from getting through. As their fuel reserves ran down, the bank decided to failover to a backup datacenter in another state — a manual process that took parts of its database (and thus its application) offline for roughly an hour.
Ultimately, the outage cost the bank in terms of both lost revenue and lost customer trust. Seeing that CockroachDB — which they had been running on the side as a PoC — had survived the outage automatically without any downtime or manual intervention, was evidence enough that shifting to a modern, distributed SQL database was actually less risky than sticking with their legacy system.
Planned downtime, which is often required for updates and upgrades to legacy RDBMS, can be just as costly as unplanned downtime in the long run.
One Fortune 50 bank, for example, originally ran most of its systems on Oracle, but making software updates — which were often mandatory to maintain security compliance — generally required taking the database offline. That was a manual process that cost time, effort, and team morale, because to minimize the disruption to customers, updates were often scheduled in the middle of the night.
The same was true for schema updates, which are occasionally necessary as software evolves and the business grows. Implementing them was simply not possible without taking the database offline.
Performance gains were another reason bank representatives cited for their moves to CockroachDB. Running mission-critical applications that can perform at global scale often means implementing a multi-region architecture to locate data as close to users as possible and to increase fault tolerance. Legacy RDBMS were simply not built to do this, and attempts to make them multi-region often break down at scale.
For example, one Fortune 50 bank built an IAM authentication layer to connect to all of its various products and manage customer account creation and login. This system was originally built using Oracle GoldenGate on AWS, but when they replicated their data across several regions to improve fault tolerance, it introduced a tremendous amount of lag — from 30 seconds to as long as five minutes.
This resulted in a terrible user experience. For example, customers would create an account and then attempt to log in and be denied access because their account information hadn’t yet been propagated across all regions.
The bank knew that wouldn’t work. To meet customer demands, they needed lag that was measured in milliseconds, not minutes. They identified the DBMS as the performance chokepoint, and after investigating other options, discovered that CockroachDB could massively improve performance, with read lag of just 1 ms and write lag of 280 ms — orders of magnitude faster than the 30-second lag they were getting with Oracle.
(Note: These lag times are specific to that bank’s particular use case and AWS networking setup as of several years ago when they swapped this system over to CockroachDB.)
Getting better performance, more resilience, and more scale out of legacy RDBMS is possible. But it’s a largely manual endeavor that comes with significant costs.
For banks in particular, the increased cost of those manual processes may be less important than the fact that increasing complexity also increases risk. When operationally complex systems fail, they tend to be difficult and time consuming to fix. Fortune 50 banks — all banks, really — thus have an inherent interest in systems that can achieve their ideal outcomes (high performance, high fault tolerance, strong consistency) without increasing operational complexity.
For example, “We do things to try and make performance be good for the end user by making copies of data,” says an Executive Director at one of the Fortune 50 banks, “but that’s a lot of complexity. CockroachDB, with global tables and nonvoting replicas […] can be used to do the same thing much more simply.”
In some cases, major banks are also leaving legacy RDBMS because of a general drive to modernize their stack to offer a better customer experience, to remain in compliance with financial regulations, and to work better with the distributed, cloud-based microservices architectures that they have adopted to run their application logic.
Even banks that haven’t encountered costly black swan events or found application performance bottlenecked by their database like the banks discussed in this article are aware that those are possibilities with legacy RDBMS, and on a long enough timeline, probably inevitabilities.
While there are risks associated with any kind of migration, increasingly banks recognize that those risks are significantly outweighed by the risks of continuing to rely on legacy technologies. After all, their competitors — the banks discussed in this article — are modernizing, and the switch to CockroachDB has given them some significant competitive advantages.
Moving to CockroachDB has had major upsides for both of the banks, and both have significantly expanded their internal use cases for CockroachDB since their initial onboarding. Before we get to the specifics of what they’re getting from CockroachDB, though, let’s take a quick look at how they’re using it.
Global banks like the ones discussed in this article have thousands of production databases, and they’re used for everything from mission-critical workloads for consumer applications — things like IAM, payment processing, deposits, etc. — to internal project management. No single piece of database software is right for every workload, and both banks use a variety of DBMS.
CockroachDB is a distributed SQL database that is particularly well suited to transactional use cases, particularly at scale and across regions. It offers serializable isolation by default, and is an excellent choice for any workload that demands strong consistency. And because of its high level of fault tolerance — it can be configured to survive region failure or even cloud failure if run multi- or hybrid-cloud — and the fact that software and schema update can be run without taking the database offline, it’s also a popular choice for mission-critical workloads.
The nature of banking dictates that they have many workloads that tick all three of these boxes: mission-critical transactional workloads where consistency is an absolute necessity.
“Consistency is just huge for banking,” says an Executive Director at one of the Fortune 50 banks. “When you check your balance, do you expect it to be correct? You do, and you expect it to be up to date. That’s actually something that’s not that easy to do. CockroachDB offers a solution to that at scale that just enables that to happen much more effectively.”
CockroachDB is thus used for a variety of workloads, from deposits to payment processing to IAM across the banks in question. In fact, at one of the banks, CockroachDB is offered as an internally managed service on their private cloud, so that any development team can easily spin up a cluster for whatever they’re working on.
CockroachDB allows banks to reduce their risk profile in a wide variety of ways:
CockroachDB makes growth and innovation easier for banks primarily by reducing operational complexity and decreasing the overall complexity of the system (from a developer perspective).
Growth with a legacy RDBMS like Oracle or PostgreSQL typically means manually sharding it to achieve the desired level of scale and, in the case of global banks, a multi-region deployment. But manual sharding is time consuming, as it entails (among other things) refactoring almost all application code that interacts with the database. This refactoring must be repeated every time the database has to be scaled up. And manual sharding results in a technically complex, brittle system that requires significant time and energy to operate.
All of these factors make working with legacy RDBMS at global scale an anchor for developers and operations staff. Updates require taking the database offline. Any updates to the application require significant consideration of how those updates will interact with the database. Problems, when they arise with mission-critical systems, will require all-hands-on-deck approaches that drain your team’s energy and (often) harm work/life balance.
Working with CockroachDB has enabled the banks to largely avoid these problems because while CockroachDB is itself complex, developers and ops personnel largely don’t have to interact with that complexity. From a developer perspective, CockroachDB can be treated almost exactly like a single-instance PostgreSQL database — manual partitioning, writing custom routing logic, and refactoring application code are not required.
Similarly, CockroachDB’s self-healing nature can make outages and other similar events much easier to deal with. One of the Fortune 50 banks regularly runs practice drills for black swan events as well as “repave” days to ensure their teams and systems are ready in the event of the unexpected. For teams that work with traditional RDBMS, an executive director at the bank says, “there’s a lot of overhead there.” But for the growing number of teams that use CockroachDB, “generally it’s a no-op. [They’ll] get some telemetry or some observability telling them that some nodes are down or a network is isolated, but there’s nothing for them to do” because the database survives and recovers automatically.
“When you’re doing it at scale,” he says, “[CockroachDB’s] operational simplicity has been very effective at helping our developers to not worry about those things, and just keep working on their business logic.” And that, in turn, helps the bank to grow and innovate more quickly.
CockroachDB also contributes to increasing the profitability of these banks, for many of the same reasons discussed above:
And really, these benefits are just the tip of the iceberg.
Early stage capital is …
Read moreWhat’s the best way to do a database migration?
It’s a challenging question. No single approach is going to be the best …
Read more